Robust Measurement and Comparison of Context Similarity for Finding Translation Pairs

نویسندگان

  • Daniel Andrade
  • Tetsuya Nasukawa
  • Jun'ichi Tsujii
چکیده

In cross-language information retrieval it is often important to align words that are similar in meaning in two corpora written in different languages. Previous research shows that using context similarity to align words is helpful when no dictionary entry is available. We suggest a new method which selects a subset of words (pivot words) associated with a query and then matches these words across languages. To detect word associations, we demonstrate that a new Bayesian method for estimating Point-wise Mutual Information provides improved accuracy. In the second step, matching is done in a novel way that calculates the chance of an accidental overlap of pivot words using the hypergeometric distribution. We implemented a wide variety of previously suggested methods. Testing in two conditions, a small comparable corpora pair and a large but unrelated corpora pair, both written in disparate languages, we show that our approach consistently outperforms the other systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Challenges in Translation Teaching in the Iranian Context: An Ethnographic Study

Many studies can be found on translation teaching and students' perceptions of different classroom prac- tices. However, few studies have focused on how translator trainers view the task at hand. In the present study, the researchers aimed to explore the challenges translator trainers faced during the act of teaching translation. The method applied in this research was ethnography. The pa...

متن کامل

An Empirical Comparison of Distance Measures for Multivariate Time Series Clustering

Multivariate time series (MTS) data are ubiquitous in science and daily life, and how to measure their similarity is a core part of MTS analyzing process. Many of the research efforts in this context have focused on proposing novel similarity measures for the underlying data. However, with the countless techniques to estimate similarity between MTS, this field suffers from a lack of comparative...

متن کامل

MetaMorpho TM: A Rule-Based Translation Corpus

This paper discusses the aspects of bi-lingual resource processing within a rule-based translation memory (TM) system currently being developed. Translation memories can be viewed as translation tools incorporating parallel corpora, mainly aligned at the sentence level. Usually, these corpora have no linguistic annotation, as commercial TM systems perform queries at the character level, using f...

متن کامل

Semantic Similarity Measure for Pairs of Short Biological Texts

Finding the semantic similarity between biological texts, specially short texts, such as article abstracts and experiment descriptions of microarrays, may throw important information for experts in that field. To date, these methods have not been widely explored. In this paper, a comparison of different measures to calculate the semantic similarity of pairs of short biological texts is presente...

متن کامل

Educating the Future Workforce: Soft Skills Development in Undergraduate Translation Programs in Iran

The present study set out to investigate the concept of soft skills in academic translator education in Iran. To this aim, a needs assessment was conducted with two groups of stakeholders: i.e. translation professionals and translation students. The professionals were asked to indicate the importance of a set of soft skills in the context of the translation profession. Next, we examined through...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010